Syntactic methods for topic-independent authorship attribution
نویسندگان
چکیده
The efficacy of syntactic features for topic-independent authorship attribution is evaluated, taking a feature set of frequencies of words and punctuation marks as baseline. The features are ‘deep’ in the sense that they are derived by parsing the subject texts, in contrast to ‘shallow’ syntactic features for which a part-of-speech analysis is enough. The experiments are conducted on a corpus of novels written around the year 1900 by 20 different authors, and cover two tasks. In the first task, text samples are taken from books by one author, and the goal is to pair samples from the same book. In the second task, text samples are taken from several authors, but only one sample from each book, and the goal is to pair samples from the same author. In the first task, the baseline feature set outperformed the syntax-based feature set, but for the second task, the outcome was the opposite. This suggests that, compared to lexical features such as vocabulary and punctuation, syntactic features are more robust to changes in topic.
منابع مشابه
Syntactic Stylometry: Using Sentence Structure for Authorship Attribution
Most approaches to statistical stylometry have concentrated on lexical features, such as relative word frequencies or type-token ratios. Syntactic features have been largely ignored. This work attempts to fill that void by introducing a technique for authorship attribution based on dependency grammar. Syntactic features are extracted from texts using a common dependency parser, and those featur...
متن کاملLost in Translation: Authorship Attribution using Frame Semantics
We investigate authorship attribution using classifiers based on frame semantics. The purpose is to discover whether adding semantic information to lexical and syntactic methods for authorship attribution will improve them, specifically to address the difficult problem of authorship attribution of translated texts. Our results suggest (i) that frame-based classifiers are usable for author attri...
متن کاملDomain Independent Authorship Attribution without Domain Adaptation
Automatic authorship attribution, by its nature, is much more advantageous if it is domain (i.e., topic and/or genre) independent. That is, many real world problems that require authorship attribution may not have in-domain training data readily available. However, most previous work based on machine learning techniques focused only on in-domain text for authorship attribution. In this paper, w...
متن کاملAutomatic Authorship Detection Using Textual Patterns Extracted from Integrated Syntactic Graphs
We apply the integrated syntactic graph feature extraction methodology to the task of automatic authorship detection. This graph-based representation allows integrating different levels of language description into a single structure. We extract textual patterns based on features obtained from shortest path walks over integrated syntactic graphs and apply them to determine the authors of docume...
متن کاملShallow Text Analysis and Machine Learning for Authorship At- tribution
Current advances in shallow parsing and machine learning allow us to use results from these fields in a methodology for Authorship Attribution. We report on experiments with a corpus that consists of newspaper articles about national current affairs by different journalists from the Belgian newspaper De Standaard. Because the documents are in a similar genre, register, and range of topics, toke...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Natural Language Engineering
دوره 23 شماره
صفحات -
تاریخ انتشار 2017